Day 21 利用transformer自己實作一個翻譯程式(三) 文字標籤化和去標籤化

2021 iThome 鐵人賽

DAY 21

AI & Data

Attention到底在關注什麼？系列第 21 篇

13th鐵人賽

guancioul

2021-09-21 20:26:40

1453 瀏覽

分享至

前言

昨天講到要怎麼建立環境和下載資料集，今天要來講文字的處理

文字標籤化和去標籤化

由於模型沒有辦法直接訓練文字，因此要對文字做一些處理

這些文字要先轉換成一些數字，有一個處理方法在這個網站

這邊要先下載並且import儲存的資料集

model_name = "ted_hrlr_translate_pt_en_converter"
tf.keras.utils.get_file(
    f"{model_name}.zip",
    f"https://storage.googleapis.com/download.tensorflow.org/models/{model_name}.zip",
    cache_dir='.', cache_subdir='', extract=True
)

Downloading data from https://storage.googleapis.com/download.tensorflow.org/models/ted_hrlr_translate_pt_en_converter.zip
188416/184801 [==============================] - 0s 0us/step
196608/184801 [===============================] - 0s 0us/step
'./ted_hrlr_translate_pt_en_converter.zip'

接著要讓tensorflow去讀取下載的model

tokenizers = tf.saved_model.load(model_name)

透過dir()可以知道tokenizers有哪些method可以使用

[item for item in dir(tokenizers.en) if not item.startswith('_')]

['detokenize',
 'get_reserved_tokens',
 'get_vocab_path',
 'get_vocab_size',
 'lookup',
 'tokenize',
 'tokenizer',
 'vocab']

tokenize就是將string轉成ID的方法

for en in en_examples.numpy():
  print(en.decode('utf-8'))

and when you improve searchability , you actually take away the one advantage of print , which is serendipity .
but what if it were active ?
but they did n't test for curiosity .

en_examples是昨天切出來的資料集中的前三筆資料

encoded = tokenizers.en.tokenize(en_examples)

for row in encoded.to_list():
  print(row)

[2, 72, 117, 79, 1259, 1491, 2362, 13, 79, 150, 184, 311, 71, 103, 2308, 74, 2679, 13, 148, 80, 55, 4840, 1434, 2423, 540, 15, 3]
[2, 87, 90, 107, 76, 129, 1852, 30, 3]
[2, 87, 83, 149, 50, 9, 56, 664, 85, 2512, 15, 3]

這邊可以看到轉換過後的資料會長什麼樣子

最前面的2是開頭，最後面的3是結尾

detokenize可以把ID轉換回文字

round_trip = tokenizers.en.detokenize(encoded)
for line in round_trip.numpy():
  print(line.decode('utf-8'))

and when you improve searchability , you actually take away the one advantage of print , which is serendipity .
but what if it were active ?
but they did n ' t test for curiosity .

用lookup可以將token的ID轉換成token的文字

tokens = tokenizers.en.lookup(encoded)
tokens

<tf.RaggedTensor [[b'[START]', b'and', b'when', b'you', b'improve', b'search', b'##ability', b',', b'you', b'actually', b'take', b'away', b'the', b'one', b'advantage', b'of', b'print', b',', b'which', b'is', b's', b'##ere', b'##nd', b'##ip', b'##ity', b'.', b'[END]'], [b'[START]', b'but', b'what', b'if', b'it', b'were', b'active', b'?', b'[END]'], [b'[START]', b'but', b'they', b'did', b'n', b"'", b't', b'test', b'for', b'curiosity', b'.', b'[END]']]>

從上面的token可以看到，這個方法會把一些詞性相關的subword呈現出來